Dataset: Financial Contributions to Presidential Campaigns (Ohio State)
Time: 2016
The reason to choose this dataset:
Ohio is known as a swing state which could forecast the election result by the status of Ohio state.
## the total number of row in oh_data: 164475
## [1] "cmte_id" "cand_id" "cand_nm"
## [4] "contbr_nm" "contbr_city" "contbr_st"
## [7] "contbr_zip" "contbr_employer" "contbr_occupation"
## [10] "contb_receipt_amt" "contb_receipt_dt" "receipt_desc"
## [13] "memo_cd" "memo_text" "form_tp"
## [16] "file_num" "tran_id" "election_tp"
## [19] "party" "Month_Yr" "Day_Month"
## [22] "weekday" "surname" "gender"
## cmte_id cand_id cand_nm
## C00575795:71194 P00003392:71194 Clinton, Hillary Rodham :71194
## C00577130:34686 P60007168:34686 Sanders, Bernard :34686
## C00580100:24166 P80001571:24166 Trump, Donald J. :24166
## C00574624:16406 P60006111:16406 Cruz, Rafael Edward 'Ted':16406
## C00573519: 7937 P60005915: 7937 Carson, Benjamin S. : 7937
## C00581876: 4824 P60003670: 4824 Kasich, John R. : 4824
## (Other) : 5262 (Other) : 5262 (Other) : 5262
## contbr_nm contbr_city contbr_st
## STOWE, JANICE : 277 COLUMBUS : 17328 OH:164475
## MISSLER, ANDREW J. MR.: 203 CINCINNATI: 15630
## BRIONES, BERTA : 179 CLEVELAND : 5778
## MOESER, DONALD : 176 DAYTON : 4634
## CUMMINGS, JOHN : 142 TOLEDO : 3287
## SCHEEL, PATRICK : 133 AKRON : 3206
## (Other) :163365 (Other) :114612
## contbr_zip contbr_employer
## Min. : 10 RETIRED :27097
## 1st Qu.:431109498 N/A :22434
## Median :440942900 SELF-EMPLOYED : 8353
## Mean :368573923 NONE : 7638
## 3rd Qu.:450131451 INFORMATION REQUESTED: 7611
## Max. :458969665 (Other) :91213
## NA's :3 NA's : 129
## contbr_occupation contb_receipt_amt contb_receipt_dt
## RETIRED :43434 Min. :-10800 Min. :2014-07-17
## NOT EMPLOYED :10378 1st Qu.: 16 1st Qu.:2016-02-29
## INFORMATION REQUESTED: 7549 Median : 28 Median :2016-05-31
## ATTORNEY : 3320 Mean : 120 Mean :2016-05-16
## HOMEMAKER : 3234 3rd Qu.: 80 3rd Qu.:2016-08-25
## (Other) :96538 Max. : 29100 Max. :2016-11-28
## NA's : 22
## receipt_desc memo_cd
## :162495 :127925
## Refund : 887 X: 36550
## REDESIGNATION FROM PRIMARY: 211
## REDESIGNATION TO GENERAL : 210
## REATTRIBUTION TO SPOUSE : 114
## REATTRIBUTION FROM SPOUSE : 112
## (Other) : 446
## memo_text form_tp
## :114599 SA17A:128232
## * EARMARKED CONTRIBUTION: SEE BELOW: 33677 SA18 : 35356
## * HILLARY VICTORY FUND : 14385 SB28A: 887
## EARMARKED FROM MAKE DC LISTEN : 282
## *BEST EFFORTS UPDATE : 246
## REDESIGNATION FROM PRIMARY : 211
## (Other) : 1075
## file_num tran_id election_tp
## Min. :1003942 A80E77D0E713E417AA88: 3 : 522
## 1st Qu.:1077664 C11887628 : 3 G2016: 56271
## Median :1096260 C10225661 : 2 P2016:107682
## Mean :1095976 C10228611 : 2
## 3rd Qu.:1119042 C10230213 : 2
## Max. :1134173 C10234145 : 2
## (Other) :164461
## party Month_Yr Day_Month weekday
## Length:164475 2016-10:18582 Min. : 1.00 Monday :26927
## Class :character 2016-07:18208 1st Qu.: 8.00 Tuesday :29339
## Mode :character 2016-03:16599 Median :15.00 Wednesday:29176
## 2016-08:14777 Mean :16.04 Thursday :23544
## 2016-04:14059 3rd Qu.:25.00 Friday :24160
## 2016-02:13335 Max. :31.00 Saturday :16619
## (Other):68915 Sunday :14710
## surname gender.gender
## Length:164475 Length:164475
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
From the output and the definition of variables, I could know about the types of variables and decide the next exploration step.
the total number of row in oh_data: 164475 rows After enrichment, there are 24 variables.
The key questions I would like to anwser through this dataset are:
1) if there is any correlation between contributed amount and the voting result?
2) if there is any patterns for people donate funding? e.g. occupation, gender, city they live
Key variable: donation amount (contb_receipt_amt, numeric variable)
other numeric variable for exploring distribution: N/A
some important non-numeric variables: candidate names(cand_nm), gender(gender), occupation(contbr_occupation), cities (contbr_city), party(party)
## [1] -10800 29100
The distribution is quite spread and there are some negative numbers due to refund. For having a better view on donation amount, I used natural logarithm, log base 10, to transform my plot. With logarithm, I can see that the most common donation amount is around US$50.1 (10^1.7)
## [1] COLUMBUS COLUMBUS CINCINNATI COLUMBUS COLUMBUS COLUMBUS
## [7] CINCINNATI AKRON DAYTON COLUMBUS COLUMBUS COLUMBUS
## [13] TOLEDO COLUMBUS COLUMBUS
## 1341 Levels: BATAVIA 45320 ABERDEEN ADA ADAMS COUNTY ADDYSTON ... ZOAR
## [1] COLUMBUS COLUMBUS CINCINNATI COLUMBUS COLUMBUS COLUMBUS
## [7] CINCINNATI AKRON DAYTON COLUMBUS COLUMBUS COLUMBUS
## [13] TOLEDO COLUMBUS COLUMBUS
## 10 Levels: COLUMBUS CINCINNATI CLEVELAND DAYTON TOLEDO ... LAKEWOOD
## 24 unique candidates
## 6555 unique occupations
## 1341 unique contributed cities
By plotting the bar charts and counting unique numbers of these non-numeric variables, there are too many unique data in terms of occupations and cities. It is difficult to read data from the graphs if ploting all occupations or cities so I plotted top 15 occupations and top 10 cities which contributed the most funding.
In terms of candidates, there are only 24 unique candidates so I used abbreviation of each candidate’s names to plot a bar chart. From the bar chart, C.HR got the most contributed amount in Ohio state.
Although there are more donation records for Democratic party, there are more donated amount for Republican party. It might be caused by the average donation to Republican is higher.
The proportion of gender is almost equal (female : male is around 5 : 5)
## # A tibble: 1,341 × 2
## contbr_city total_amount
## <fctr> <dbl>
## 1 CINCINNATI 2605688.7
## 2 COLUMBUS 2226563.1
## 3 CLEVELAND 866239.9
## 4 CHAGRIN FALLS 383091.9
## 5 DUBLIN 379636.9
## 6 SHAKER HEIGHTS 376150.9
## 7 AKRON 358729.6
## 8 DAYTON 353846.1
## 9 CANTON 277801.2
## 10 WESTERVILLE 254291.7
## # ... with 1,331 more rows
## # A tibble: 10 × 2
## contbr_city total_amount
## <fctr> <dbl>
## 1 CINCINNATI 2605688.7
## 2 COLUMBUS 2226563.1
## 3 CLEVELAND 866239.9
## 4 CHAGRIN FALLS 383091.9
## 5 DUBLIN 379636.9
## 6 SHAKER HEIGHTS 376150.9
## 7 AKRON 358729.6
## 8 DAYTON 353846.1
## 9 CANTON 277801.2
## 10 WESTERVILLE 254291.7
I listed top 10 cities in terms of donation records and donation amounts. Take Columbus as an example, there are the most donation records among the cities but the donation amount is not the top 1 city. It shows that some cities might have more relatively small amount of donation.
There are 164,475 obs in the Ohio dataset with 18 original varibles. For analysis purpose, I added 6 extra varibles (party, Month_Yr, weekday, day of month, surname and gender)
The main features in the data set are “contb_receipt_amt” and the factors influencing the amounts. I’d like to find out which features have the most impact on raising more contributed amounts and I’d like to provide a few suggestions for candidates in the future when running a election found-raising campaign. I suspect city, occupation and day of week matter.
Since 2016 American presidential election result has came out, it would be great to do comparison analysis between contributed amount data and the final voting result data. I downloaded the voting result data for analyzing the correlation between contributed amount and the voters in Ohio. (The analysis is covered in the next section.)
Yes, I create 3 variables for further analysis. The 3 variables are listed below.
1) Party: I categorized data into 3 categories(D, R, Other) based on candidate name
2) Month_Yr: showing the contributed amount trend by month
3) weekday: analyzing if there is a huge difference between weekday and weekend.
4) Day_Month: the day of month 5) surname: for predicting the gender by gender library 6) gender: the gender of the contributors
I enriched the Ohio dataset with Zipcode to visualize the contributed amount on Ohio map.(The analysis is conducted in multivariate plots section.)
After merging with Ohio zipcode data from Zipcode library, I found there are 83 potential wrong zipcode data so I excluded them when I was plotting the contributed amount on the map. The reason why I excluded is that it is hard to identify the correct zipcode simply based on city names.
## Source: local data frame [2,475 x 4]
## Groups: contbr_city [?]
##
## contbr_city party count total_amount
## <chr> <chr> <int> <dbl>
## 1 batavia Republican 1 500.00
## 2 45320 Republican 1 80.00
## 3 aberdeen Democratic 5 900.00
## 4 aberdeen Republican 2 44.00
## 5 ada Democratic 97 4272.00
## 6 ada Other 43 3682.88
## 7 ada Republican 18 1458.00
## 8 adams county Republican 1 80.00
## 9 addyston Democratic 11 392.55
## 10 addyston Republican 3 190.00
## # ... with 2,465 more rows
## contbr_city Democratic Other Republican
## 1 batavia 0.00 0.00 500
## 2 45320 0.00 0.00 80
## 3 aberdeen 900.00 0.00 44
## 4 ada 4272.00 3682.88 1458
## 5 adams county 0.00 0.00 80
## 6 addyston 392.55 0.00 190
I noticed that the relationship between distributed amount and the number of voters is not positively strong. It seems to have week relationship which is against my original assumption.
When dicussing the relationship between the contributed amount and the toal voters, Republican party supporters show stronger correlation than Democratic party supporters.
The correlation coefficient between contributed amount and voter numbers 1)Republican party : 0.401 2)Democratic party : 0.184
The correlation coefficient is higher than the correlation coefficient of total contributed amount and total voter numbers in Ohio (which is 0.307)
The relationship between the total contributed amount and the contributed amount of Republican party is super relative (the correlation coefficient is 0.934) because the contributed amount from Republican party supporters accounts for ~60%.
However, this is not a proper pair to check the relationship because these 2 factors are not independent.
## cand_nm contbr_city contbr_zip contb_receipt_amt
## 1 Cruz, Rafael Edward 'Ted' LEESBURG 451359416 25.00
## 2 Cruz, Rafael Edward 'Ted' MINERVA 446579402 25.00
## 3 Clinton, Hillary Rodham COLUMBUS 432141210 40.00
## 4 Sanders, Bernard COLUMBUS 432022420 50.00
## 5 Clinton, Hillary Rodham LEBANON 450365038 57.31
## 6 Sanders, Bernard CINCINNATI 45249 2.50
## party
## 1 Republican
## 2 Republican
## 3 Democratic
## 4 Other
## 5 Democratic
## 6 Other
## [1] 27392
## [1] 27309
## [1] 83
I noticed that the major cities account for more contributed amount. After visualing on the map, it shows clearly that there are a few of heat spots in Ohio.
After distinguishing the contributed amount by party, it shows that there are more funding going to Republican party and it refelects on voting result that Republican party won Ohio at the end.
No. I tried to build a linear regression model between numeric and catergorical data but it failed and it seems to involve more complexing statistical library.
The correlation coefficient between contributed amount and voter numbers 1)Republican party : 0.401 (plot 1-2) 2)Democratic party : 0.184 (plot 1-3)
The correlation coefficient is higher than the correlation coefficient of total contributed amount and total voter numbers in Ohio (which is 0.307, plot 1-1)
Based on the analysis of contributed amount by weekday, it shows that there is lower contributed amount on weekend. This might cause by the reason that people tend to leave their weekend time for family. I would suggest to set some stops in the places where people love to go with their family during weekend. It might help to increase the funding rose on weekend.
It shows that the contributed money is mainly from city area such as Columbus, Cleveland, Akron and Cincinnati etc. It helps candidates to identify the cities to plan their future campaigns for raising more funding.
I distinguish the funding for Republican party and Democratic party by color in Plot3-1. It shows that there are more funding for Republican party in Ohio and the voting result also shows that Republican party won Ohio state.
Before starting the analysis, I assumed that the contributed amount would be a strong indicator for election result. After analyzing the relationship between the election result of Ohio and the contributed amount data of Ohio. The correlation coefficient between these 2 factors are lower than I expected and it can’t be suspected as having strong correlation between contributed amount and voter numbers.
However, this is only analyzing one state. I think, for optimizing/ further analayzing, I would suggest to analyze the data of all states in the U.S. to see if there are any strong relationship between these 2 factors.
During the analysis, I was quite struggling with more than 6,000 occupations which I thought there might be some insights to br cracked. It would be better if there are some default options for people to choose while they are making donation, such as “Retired”, “Public Servant”, “Military Soldiers” or “Teachers” etc. I could cross-check with each party’s party platform to see if party platform have any impact on donation amounts by occupation.
ggplot2 axis ticks : A guide to customize tick marks and labels
Compute the number of classes for a histogram.
Problem while loading data: duplicate ‘row.names’ are not allowed error in R programming
Why does a boxplot in ggplot requires axis x and y? What does stat means in ggplot? ggplot2 line chart gives “geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?” Find the day of a week in R